Skip to content

celeris v1.5.3 — core-engine performance milestone#384

Merged
FumingPower3925 merged 43 commits into
mainfrom
milestone/v1.5.3
Jun 21, 2026
Merged

celeris v1.5.3 — core-engine performance milestone#384
FumingPower3925 merged 43 commits into
mainfrom
milestone/v1.5.3

Conversation

@FumingPower3925

Copy link
Copy Markdown
Contributor

Release PR for v1.5.3, the core-engine performance milestone (io_uring / epoll / adaptive).

Release-prep (this PR's tail)

  • Version 1.5.0 → 1.5.3 (was stale; even v1.5.2 shipped "1.5.0").
  • SECURITY.md updated for the 1.5.x supported line + a v1.5.x Security Improvements section (io_uring gen-tagged-CQE UAF hardening, H1 request-smuggling hardening, cpuMon + epoll-detach data-race fixes, middleware/secure COEP/X-Download-Options default-off behavior change, Go 1.26.4 toolchain bump).
  • Four middleware submodule require pins bumped v1.4.4 → v1.5.3.
  • probe: moved the CAP_SYS_NICE side-effect test under the linux build tag so the package compiles on darwin (local go test ./... was broken; CI unaffected).

Milestone highlights (the 38 commits)

Adaptive kernel/feature-gated start engine + lazy standby, effective AsyncHandlers for drivers + immediate-promote, dynamic-worker-scaler removal, io_uring/epoll memory-safety + race hardening, H1 parser hardening.

Merges fast-forward onto main (0 conflicts). After merge, the release is cut via gh release create v1.5.3 which fires release.yml (validate-tag → ci → tag-submodules → notify-proxy).

…nd was dead)

Three wrong io_uring ABI constants silently disabled zero-copy send on every
kernel:
- opSENDZC was 53 (IORING_OP_FUTEX_WAITV); IORING_OP_SEND_ZC is 47. SEND_ZC
  SQEs carried the wrong opcode → kernel returned -EINVAL on the probe →
  SEND_ZC reported unsupported and was disabled everywhere.
- cqeFNotif was 1<<2 (0x04 = IORING_CQE_F_SOCK_NONEMPTY); IORING_CQE_F_NOTIF
  is 1<<3 (0x08). Both the probe and the runtime notification handler
  (cqeIsNotif → handleSend) misread the zero-copy notification CQE.
- probeSendZC hardcoded the 0x04 NOTIF check; now uses cqeFNotif.

Also corrects the unused opSHUTDOWN constant (was 52=FUTEX_WAKE; IORING_OP_
SHUTDOWN is 34) to prevent the same class of bug if it is ever used.

Verified on the cluster (kernel 7.0.0-22): the SEND_ZC probe now reports
"true zero-copy" and the engine selects send_zc=true (previously
send_zc=false / 'kernel rejected SEND_ZC opcode'). Note: perf-neutral on the
current HTTP benchmark suite (interleaved A/B: get-json +0.4%, get-json-64k
+0.0% — small responses don't use ZC's benefit, large are bandwidth-bound),
but it is a genuine correctness fix that enables io_uring's zero-copy path and
may help CPU-copy-bound workloads.

Refs #356
Under Config.AsyncHandlers=true, a route that inherits the server/group default
(not an explicit .Async()) is now ADAPTIVE: it runs INLINE on the event-loop
worker (the ring-batched send path, which gets io_uring's ~33x syscall
reduction) and is promoted to async dispatch only when an inline run is
observed to block (>50us). Trivial handlers thus keep the cheap inline path;
genuinely-blocking handlers (DB/cache round-trips) still get goroutine
isolation after one inline run.

Why: the shipped iouring-h1-async config dispatched EVERY request to a
goroutine that did a direct write(2) per response, bypassing the ring (~42% CPU
in unix.Write) — making it slower than epoll. Measured: on CPU-bound chain
cells the ring path beats epoll +2.4..6.7%; this lets the async config reach
that path for non-blocking handlers.

- router: adaptiveRoutes set (built at registration) + promoted sync.Map;
  routeAsync returns async || (adaptive && promoted). Explicit .Async()/.Sync()
  removes the route from the adaptive set (setAsync).
- handler: HandleStream times an inline adaptive run and promotes on block.
- Non-adaptive configs (no AsyncHandlers default) keep the empty-map fast path,
  zero added overhead.

Contract change (reflected in router_async_test.go): AsyncHandlers=true now
means inline-first-adaptive, not all-async. Explicit .Async() is unchanged.

Refs #356
…ode-fix

fix(iouring): correct SEND_ZC opcode + CQE_F_NOTIF flag — zero-copy send was dead on every kernel
…nline

feat(core): adaptive inline-first dispatch under AsyncHandlers (#356) — iouring-async +7.5..48% on CPU-bound
…routes

Blocking driver routes (probatorium /cache, /db, /mc) register with an
explicit .Async() on an async-default server. setAsync drops them from
the adaptive set, so handler.go's inline-first gate skips them and they
resolve hard-async — no inline window, no worker stall. Pin that
contract so the driver no-regression guarantee can't silently break.
parseHeaders already detects an incomplete header block (a final line with
no CRLF yields lineEnd==-1 -> (false,nil)), and every ParseRequest caller
Reset()s parser+req first, so a partial parse is always retried cleanly.
The upfront findHeaderEnd CRLF-walked the same bytes parseHeaders walks
again -- pure double work on the common single-read request. Slow-drip
re-parse is bounded by MaxHeaderSize(64K)/MaxHeaderCount(200)/ReadHeaderTimeout,
matching net/http and fasthttp which also re-parse on partial reads.

Correctness: full h1 race suite + 46M fuzz execs (ParseRequest+ChunkedBody)
pass; findHeaderEnd retained (asm + its own tests).
…r-scan

perf(h1): drop redundant upfront findHeaderEnd block scan (#359)
When a middleware stack pushes respHeaders past the inline 16-slot
respHdrBuf (chain-fullstack: 17 headers), append() moved it onto a heap
array that reset then dropped — forcing a fresh ~576B alloc every
header-heavy request. Retain it as respHdrScratch and reuse it: one alloc
per pooled Context, not per request. Also folds SetResponseHeaders'
>16-header path onto the scratch and removes the old clear(respHdrBuf[:n>16])
clamp foot-gun (the cap check is the correct discriminator).

NOTE: rps-neutral on the cluster (one small alloc/req was not the
chain-fullstack bottleneck); this is a GC/RSS-hygiene + foot-gun-removal
change. Verified 0 allocs/op (TestContextRespHeaderOverflowReuseZeroAlloc)
+ full root race suite + overflow regression guard.
#361)

The #356 classifier timed every inline run of an adaptive route (two
time.Now() vDSO calls + recordInlineRun) forever. A profile of
iouring-h1-async chain-fullstack showed time.runtimeNow at 3.22% CPU on
routes that never block. Add a settled-fast terminal state: after
adaptiveSettleStreak (256) CONSECUTIVE fast inline runs a route is proven
non-blocking and leaves the timed path (adaptiveLearning short-circuits on
a single settled sync.Map.Load — the same lookup the prior isPromoted
check cost, minus the two time.Now()). A slow run resets the fast streak;
explicit .Async()/.Sync() clears settled (setAsync).

A/B (iouring-h1-async, interleaved): chain-api +1.4% (both rounds),
get-json +0.4%. Full root race suite + new TestRouteAsync_AdaptiveSettles
+ existing #356 hysteresis/promotion tests pass.
perf(core): settle non-blocking adaptive routes to stop per-req timing (#361)
The 50µs bar was below the CPU-jitter range: a transient GC/scheduling
burst could push a CPU-bound middleware chain (chain-fullstack, ~20µs
base) over 50µs for 8 consecutive runs and wrongly promote it to the
slower async path, intermittently collapsing iouring-async chain-fullstack
by ~32% for a whole run (#364). 300µs sits far above the CPU-bound range
(even a heavy chain under jitter) while still catching genuinely-blocking
handlers (sub-ms+) — which are marked .Async() in practice anyway.

Hardening, not a proven fix: the collapse is rare (<10%, did not reproduce
in 10 fresh-SUT runs) so elimination can't be directly measured. rps-neutral
within the ~±2% A/B noise floor (chain-fullstack/api/get-json). Inherited
adaptive routes are CPU-bound (never legitimately promote); the only
behavior change is a moderate-latency unmarked-blocking route promotes
later — such routes should use explicit .Async().
…reshold

fix(core): raise adaptive promote threshold 50µs->300µs (#364 hardening)
…-reuse

perf(core): reuse heap respHeaders backing across requests (#360)
ResetH1Stream re-zeroed Headers/LazyRawHeaders/Method/Path/Scheme/Authority/
IsHEAD/EndStream/ResponseWriter + an atomic state.Store(StateIdle) that
populateCachedStream + handleH1Request unconditionally overwrite on the next
request (h1.go:852-941, single caller) — pure dead work + a redundant atomic
on every keep-alive request. Keep only the fields the caller does NOT redo:
rawBody=nil (load-bearing — a bodyless GET must not inherit a prior body),
lazyHeadersBuilt, pseudoMaterialized, headersSent.

Verified: caller sets all 10 dropped fields; full h1/stream/internal-conn
race suites + h1 conformance pass (a stale-state leak would surface across
keep-alive requests); new TestResetH1StreamClearsUniqueFields guards the
rawBody contract; 0 allocs/op.
perf(h1): drop dead-store writes in ResetH1Stream (#346)
ioUringBias is a heuristic that estimates io_uring's advantage from connection
count + CPU pressure but NEVER reads the standby engine's measured throughput.
Ungated, biasModeledStandbyScore could fabricate a standby score high enough to
switch adaptive onto an engine that is measurably SLOWER on the live workload
(e.g. epoll-favored get-simple). Gate it behind CELERIS_ADAPTIVE_IOURING_BIAS
(default off): with the bias off, bias=0 → the modeled standby never exceeds the
active score → adaptive switches only on MEASURED active degradation vs a
previously-observed standby, never speculatively onto an unmeasured/slower
engine. The speculative bias stays opt-in for re-validation.

Tests: TestControllerOrganicSwitch now forces biasEnabled=true (the bias is
opt-in); new TestControllerNoSpeculativeSwitchBiasOff asserts the default-off
no-switch on the same sweet-spot workload; score_test uses the 2-arg form. Full
adaptive suite passes on linux/amd64 (cluster).
…lt-off

fix(adaptive): gate io_uring bias off by default (#341)
Set TCP_NODELAY once on the listen socket (Linux copies it onto every accepted
socket at SYN time) and drop it from the per-accept sockopts.Options, removing
one setsockopt syscall per accept on the hot path. rps-neutral on the cluster
(churn-close within the ~±2% A/B noise; the bench is NIC-bound) — a syscall-count
efficiency win, not a throughput change.

Verified: full epoll suite passes; new linux tests assert the listen socket has
TCP_NODELAY AND accepted conns inherit it (guards that removing the per-accept
setsockopt does NOT silently re-enable Nagle on the hot path).
perf(epoll): inherit TCP_NODELAY from the listen socket (#337)
SEND_ZC adds a second (NOTIF) CQE per send and holds zcNotifPending across the
buffer's DMA lifetime, stalling the next flush — a net loss on small payloads
where the avoided memcpy is tiny. Gate it behind sendZCMinBytes=4096 via a
single useSendZC(sendZC, linked, n) helper used at all four send sites (highTier
+ optionalTier PrepareSend, worker prepSendSQE): small/linked sends use plain
SEND (1 CQE, immediate buffer reuse), large unlinked sends still use ZC.

rps-neutral on the cluster (get-json/get-simple within ±2% noise — NIC-bound;
get-json-64k confirms ZC still chosen for >=4096, no regression) → removes one
CQE/req on small async sends (efficiency), not a throughput change. Link
invariant preserved (linked sends never ZC). Full iouring suite + new gating
tests (TestUseSendZC, TestPrepSendSQEGatesBySize, TestPrepSendSQELinkedNeverZC)
pass on linux/amd64.
…e-gate

perf(iouring): gate SEND_ZC by payload size (#332)
#341 made the io_uring bias safe by DISABLING it, but that left adaptive
parked on epoll at high concurrency (it starts on epoll), missing io_uring's
measured +6.8% @1024c. Restore the win SAFELY by making the bias reversible:

  - the ACTIVE score is always the pure measurement (no reinforce/penalty), so
    leaving an engine is decided measured-vs-measured;
  - biasModeledStandbyScore boosts ONLY the io_uring standby (EXPLORE) and
    returns 0 for the epoll standby (never models it down) — so a wrongly-
    explored io_uring always REVERTS on measurement;
  - history records the unbiased measured score.

Net: adaptive explores io_uring when the workload model favors it, keeps it
only if it measures faster, and reverts otherwise. The 15% switch threshold +
oscillation lock provide hysteresis (no thrash). Safe ON by default;
CELERIS_ADAPTIVE_IOURING_BIAS=0 forces the conservative measurement-only
controller (supersedes #341's default-off).

Validated: full adaptive suite on linux/amd64 incl new explore/revert/
kill-switch/stability-under-fluctuation tests; cluster end-to-end confirmed —
adaptive explored epoll→io_uring under sustained 1024c (892k→io_uring) and
reverted to epoll when load stopped (0 throughput). Switch latency ~30s
(deliberate observe-before-act; sustained-load win, tuning follow-up).
…e-bias

feat(adaptive): reversible io_uring bias, default on (#338)
Blob assembled its response header list (content-type + content-length +
user headers) via make([][2]string, 0, total) on EVERY response whose total
exceeds respHdrBuf's 16 slots. An allocation profile of chain-fullstack (18
headers) showed this was the DOMINANT per-request alloc — ~77% of all
allocations, ~1.16 GB/s → GC pressure → the throughput cost. (get-json and
other <=14-user-header responses already hit the alloc-free inline fast path
and are unaffected.)

Reuse a per-Context blobHdrScratch (alloc once per pooled Context, not per
request), mirroring respHdrScratch (#360). respHeaders never aliases it
(separate buffer; the append copies the [2]string values).

A/B (interleaved, 2 rounds, vs baseline): chain-fullstack +4.4% (iouring-h1-
async) / +5.0% (epoll-h1-sync); get-json neutral (-0.2%/+0.1%, control — it
never enters this path). Full root race suite + new
TestContextBlobManyHeadersZeroAlloc (0 allocs/op) pass.
…ratch

perf(core): reuse blobHdrScratch for >16-header responses
…s off (#338)

BEHAVIOR CHANGE. secure.New() no longer emits Cross-Origin-Embedder-Policy
(require-corp) or X-Download-Options (noopen) by default — both are now opt-in.

COEP=require-corp by default is a footgun: it blocks cross-origin resources
(images/scripts without CORP/CORS), silently breaking many sites — the config's
own comment warned about it, yet it was on by default. Helmet leaves COEP off
for this reason; we now match. X-Download-Options only ever affected legacy IE
and is obsolete. Set either field explicitly to re-enable.

Default header count 11 -> 9 (HSTS still runtime-gated to HTTPS). Beyond the
footgun fix, the smaller response flips chain-security to a WIN vs fasthttp
(-1.2% -> +0.7%) and improves chain-fullstack (-6.0% -> -4.9%); chain-api
(no secure mw) unchanged. secure suite + middleware + conformance pass; new
coep/x-download opt-in test cases added.
fix(secure): default COEP + X-Download-Options off (opt-in) (#338)
…cv bodies

A keep-alive connection that handled even one fixed-length body split
across recvs was permanently promoted to the async dispatch goroutine
(worker.go HasPendingData gate), then served every subsequent request via
a blocking unix.Write + cross-goroutine condvar handoff instead of the
inline io_uring linked SEND. Under sustained small-POST load ~11% of
requests split, so essentially every long-lived conn was poisoned onto
the slow path within its first few requests.

A fixed-length body in progress resumes via the inline re-parse path
(ProcessH1 bodyNeeded>0), which runs on the worker that owns h1State and
is already async-checked (provably non-async) — exactly like the sync
engine. Only buffered partial headers / chunked bodies genuinely need the
InlineMode=false dispatch path. Split the gate: HasPendingDispatchState
promotes for buffered-headers/chunked only, never for a fixed body.

Also tighten pickRecvTarget: gate the zero-copy direct-into-bodyBuf recv
bail on (w.async && cs.asyncPromoted) rather than blanket w.async, so
inline-owned conns get the zero-copy body recv the sync path already uses.

The worker still owns h1State for non-promoted conns, so no new races
(this strictly REDUCES cross-goroutine handoff). Async-marked routes are
still promoted at the fresh-parse site before the body, and partial
headers still re-run the async check on completion.
The optionalTier was the only path that set IORING_SETUP_SQPOLL, reachable
when the provided-buffers probe fails on an otherwise-High kernel. That path
is doubly broken: (1) celeris runs one ring per worker, so SQPOLL spawns one
kernel poll thread per worker — N spinning cores that starve the workers
(measured -83% throughput, 75% idle CPU on a 16-worker box); (2) the dormant
SQPOLL submit path has a latent SQ-tail-publish race in Ring.GetSQE (the
shared tail is advanced before the SQE payload is written, safe only because
io_uring_enter is the sync point on the non-SQPOLL path).

optionalTier now uses the task-run completion model like highTier
(DEFER_TASKRUN|SINGLE_ISSUER, or COOP_TASKRUN on 6.0), keeping provided
buffers / multishot / SEND_ZC but never SQPOLL. SQPollIdle returns 0 so the
worker SQPOLL branch is unreachable. Documented the GetSQE SQPOLL-unsafety so
any future SQPOLL work fixes the tail-publish first.

Test updated to assert the new contract (task-run, never SQPOLL).
…omotion

fix(iouring): close post-4k gap — stop sticky async-promotion on split-recv bodies
#356 adaptive promotion was terminal: once a route accumulated
adaptivePromoteStreak slow inline runs it was pinned to async dispatch
forever (adaptiveLearning returns false for promoted routes, and a
promoted route runs async so it is never re-timed). A CPU-bound chain
whose inline WALL-CLOCK briefly crossed adaptivePromoteThreshold under
transient worker contention (not actual blocking) was therefore stuck on
the ~32%-slower async path until process restart — the intermittent
chain-fullstack collapse.

Promotion now expires after adaptivePromoteTTL (5s): isPromoted drops the
route from the promoted set and resets its slow streak once the stamp is
older than the TTL, so the next request runs inline again and is
re-timed. A genuinely-blocking route re-promotes within
adaptivePromoteStreak runs; a transient false-positive stays inline and
re-settles. The clock (nowNano, a test-stubbable package var) is read
only for routes already in the promoted set, so the inline/learning/
settled fast paths are unchanged.

Tests: promotion expires + slow-streak reset; de-promoted route settles
when fast; still-blocking route re-promotes after expiry.

NOTE: this reverts promotion at the ROUTING layer. A connection already
promoted to its async dispatch goroutine (cs.asyncPromoted) stays there
until it closes — the worker owns recv but the async->inline conn
handoff is separate, larger work. So long-lived keep-alive conns recover
on reconnect / new conns; full in-place conn recovery is a follow-up.
…otion (#364)

Completes the celeris#364 fix. PR's first commit made ROUTE promotion
reversible (TTL); this makes the per-CONNECTION promotion reversible too,
so a long-lived keep-alive conn that was promoted to its async dispatch
goroutine returns to the inline fast path when the promoting route
de-promotes — without it, such a conn stayed on the ~32%-slower
blocking-write+handoff path until it closed (the bench scenario only
recovered on reconnect).

Mechanism:
- The worker records the route that forced promotion (h1State.CurrentRoute,
  single-shot recv only) before starting the dispatch goroutine.
- The dispatch goroutine, at its idle park point (asyncInBuf drained, last
  response written, no partial request), checks canRevertToInline: route's
  RouteAsync now false. If so it clears asyncPromoted and exits; the worker
  already owns recv and resumes the inline fast path on the next CQE.
- cs.asyncPromoted becomes atomic.Bool: the goroutine clears it while the
  worker reads it on the recv hot path. The worker's feed path re-checks it
  under asyncInMu (the same lock the goroutine clears it under) before
  appending to asyncInBuf, closing the feed-vs-revert race. Only at a clean
  request boundary (HasPendingData false) so h1State ownership flips back to
  the worker exactly as for a fresh inline conn; #256 bodyRecvPin retained.

Tests (engine, linux): TestAsyncConnRevertsOnRouteDepromotion proves revert
via re-promotion (a still-async conn cannot re-promote); TestAsyncConnRevertRace
hammers promote/feed/revert/re-promote across 64 keep-alive conns. Both pass
under -race; full async-churn UAF suite stays green under -race.
…-promotion

fix: fully reversible adaptive promotion — route TTL + connection revert (#364)
The load-based worker scaler (pause/resume by connection count) is
obsolete on kernel 7.0+ — its concentration premise has reversed (more
workers win at every concurrency, 4-core and 16-core alike) and its
down-scale strands keep-alive throughput on surge-after-quiet (-31% on
get-simple-1024c in the harness sequence). Removing it recovers that
throughput with no regression. Worker pool is now static = numCPU
(Resources.Workers), all workers always active; the adaptive engine's
accept-pause/suspend lifecycle is unaffected.
Pick the START engine from probed io_uring capabilities
(chooseStartEngine): io_uring on bundles-era (6.10+) or the 6.1 fast tier
(DEFER_TASKRUN+SINGLE_ISSUER+multishot+provided-buffers), epoll otherwise.
On kernel 7.0 adaptive now starts on io_uring, capturing the high-conc
keep-alive throughput the old epoll-start default stranded (~+12%).

Standby construction is LAZY: only the start engine is built+Listen-ed
eagerly; the other is constructed on the first switch that needs it. When
the start engine is already best and never switches, the standby is never
built — cutting the dual-engine tax from ~7% to ~0.9% (same-binary
interleaved at 1024c on 7.0).

Conns-per-worker UP/DOWN switching is gated OFF in production: pinned
conns never migrate, so the start engine decides keep-alive throughput,
and the down-revert otherwise fired on idle/warmup dips and stranded
load. The io_uring error-rate safety revert stays always-on. The
conns-per-worker controller + multi-signal telemetry (conns/worker,
accept rate, bytes/req via new per-engine accept/close/byte counters) are
kept, gated, for a future middle-tier kernel with a real crossover (to be
validated by a multi-kernel sweep). Old CPU-bias score machinery removed.

Switch-safety invariants unchanged (resume-before-pause, synchronous
PauseAccept/H2-dial-RST, ASYNC_CANCEL, driver-FD refusal, freezeState).
… switch

Redesign the adaptive start-engine decision around connection pinning: an
established conn cannot migrate between epoll and io_uring, so the START engine
decides keep-alive throughput and the workload concurrency is unknowable at
Listen() time. chooseStartEngine now gates only on t0-knowable facts:

  env override -> io_uring not viable (kernel fast-tier AND RLIMIT_MEMLOCK can
  fund the workers) -> Protocol==H2C -> WorkloadHint==HighConcurrency -> default.

The default flips from io_uring-on-modern-kernels to EPOLL: every server ramps
from zero connections (the low-concurrency regime where epoll wins on both
throughput and tail latency). io_uring starts only on an explicit
WorkloadHint=HighConcurrency (new operator field) when kernel + memlock allow.

New helpers/fields:
- iouring.MaxWorkersForMemlock(): the memlock worker ceiling, exported so the
  start decision avoids io_uring's silent 1-worker collapse proactively, not
  just on construction failure. capWorkersToMemlock derives from it.
- resource.Resources.WorkloadHint + root celeris.WorkloadHint
  (Unspecified/LowConcurrency/HighConcurrency).

Runtime switch (controller) re-enabled but constrained: only on the epoll-start
path with io_uring viable and a non-h2c protocol, it promotes NEW connections to
io_uring when conns/worker sustains the crossover. Pinning keeps the switch
inert for a pure keep-alive burst; it helps ramps/churn. The load-driven
down-revert is disabled (pinning makes it harmful); the io_uring error-revert
stays always-on. Thresholds tuned to the epoll-vs-io_uring sweep: up 20->24,
high-watermark 32->48, large-payload suppression 16384->8192 bytes.

Empirical basis (msa2-server, kernel 7.0, real NIC): epoll wins <=32 conns
(+~20%, ~40% lower tail), tie 64-256, io_uring wins >=~384 conns (~24/worker,
+8-13%); io_uring's edge is h1-small-payload only (h2c/large payloads tie) and
collapses under low RLIMIT_MEMLOCK (1 worker ~= 1/5 throughput).

Validated on the cluster: resource/adaptive/iouring/epoll/root suites pass on
real io_uring; default adaptive starts epoll, env/hint force io_uring, and a
1024c load fires the epoll->io_uring switch.

Cross-engine connection migration (transplant pinned conns) deferred to a
v1.6.0 spike (#383): only H1-idle epoll->io_uring is feasible; H1-mid-request
and H2 are impossible under the current parser/HPACK/stream architecture.
…+ UsesDriver

Close the async/sync handler review's footguns:

#1 Server.AsyncHandlers() reports the EFFECTIVE async state
   (config.AsyncHandlers || router.hasAsyncRoutes()) instead of the raw config
   flag. WithEngine drivers select their netpoll-park fast path from this, so the
   recommended "AsyncHandlers=false + mark DB routes .Async()" idiom no longer
   silently drops the driver onto its slow busy-spin mini-loop. Only the driver
   registry consults this method, so the change is targeted. The value is read at
   driver construction, so open WithEngine drivers AFTER registering .Async()
   routes (or set AsyncHandlers=true); documented on the field.

#3 adaptiveBlockingThreshold (2ms): a single unambiguously-blocking inline run
   promotes the adaptive route immediately (router.promoteRouteImmediate),
   skipping the adaptivePromoteStreak (8) hysteresis, so a forgotten-.Async()
   blocking handler stalls a worker for at most one request. The 300us/8-streak
   path still owns the 300us-2ms band; a CPU chain cannot cross 2ms under jitter.

#4 Route.UsesDriver() - intent-revealing alias for .Async() on driver routes.

#2 Config.AsyncHandlers doc rewritten (effective-state behavior, construction-
   order caveat, recommend AsyncHandlers=true OR .Async()/.UsesDriver() for
   driver routes; the adaptive net only auto-promotes handlers slower than 300us).

Driver benchmark (msa2-server, 128c, footgun config = AsyncHandlers=false +
per-route .Async(), before vs after #1): the fix recovers the async win and
~halves p99, matching the global-AsyncHandlers=true fast path -
  redis     iouring  87k->107k (+23%)  p99 2.8->1.4ms ; epoll 86k->107k
  memcached iouring  92k->136k (+48%)  p99 3.0->1.2ms
  postgres  iouring  62k-> 84k (+35%)  p99 3.1->1.9ms

Tests: 8 white-box unit (async_improvements_test.go) + 1 epoll integration
(async_promote_integration_linux_test.go: single >2ms run promotes immediately);
full celeris + driver suites green on real io_uring.
celeris.Version was stale at 1.5.0 (unchanged since the 1.5.0 tag, so even
v1.5.2 shipped "1.5.0"). Bump to 1.5.3. Bump the four publishable middleware
submodules' `require github.com/goceleris/celeris` from v1.4.4 to v1.5.3 so
the tagged submodules resolve the matching core (release.yml's tag-submodules
job warns when they drift); the local `replace => ../../` keeps in-tree
builds against the live core.
The Supported Versions table + policy still named the 1.4.x line as
supported even though 1.5.0-1.5.2 had shipped. Mark >= 1.5.0 supported,
1.4.x and earlier unsupported, and add a v1.5.x Security Improvements
section (io_uring gen-tagged-CQE UAF hardening, H1 request-smuggling
hardening, cpuMon + epoll-detach data-race fixes, middleware/secure COEP +
X-Download-Options default-off behavior change, Go 1.26.4 toolchain bump).
TestCheckCapSysNiceIsSideEffectFree lived in the untagged probe_test.go but
called getNice(), which is defined only in probe_caps_linux_test.go
(//go:build linux). The package failed to compile on darwin/non-linux, so
`go test ./...` / `go vet ./...` broke for local devs (CI was unaffected:
linux jobs compile it, the macos job only runs go build). Move the test next
to getNice under the linux tag.
- middleware/metrics: prometheus/common v0.68.1 -> v0.69.0 (the published
  submodule's only outdated direct dep; supersedes dependabot #381).
- test/drivercmp/memcached + test/perfmatrix: bradfitz/gomemcache -> latest;
  perfmatrix also goceleris/loadgen v1.4.8 -> v1.4.9 (test-only modules).
- CI: actions/checkout v6 -> v7 across ci/release/drivers (supersedes
  dependabot #382).
Root module + the other three published middleware modules already pin
current direct deps. metrics builds+tests green on 0.69.0; actionlint clean.
…sting)

The adaptive package is linux-tagged, so golangci-lint only sees it on the
linux CI runner — and milestone/v1.5.3 never had a prior CI run (ci.yml gates
on PR/push to main), so four findings slipped in:
- controller.go:92 + engine.go:137: gofmt formatting.
- engine.go:256: ineffectual local 'startType = engine.Epoll' (the struct
  field e.startType is the one actually read at activeIsPrimary); drop it.
- start_test.go:26: withMemlock param 'max' shadowed the builtin (revive
  redefines-builtin-id); rename to maxWorkers.
Verified: GOOS=linux golangci-lint ./... = 0 issues, cross-compile build+vet clean.
@FumingPower3925 FumingPower3925 merged commit e4da508 into main Jun 21, 2026
31 checks passed
@FumingPower3925 FumingPower3925 deleted the milestone/v1.5.3 branch June 21, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant